Data Processing
Visualization


Michael Clark
Statistician Lead


2016-08-13

Outline

Part 1

  • Overview of Data Structures
  • Input/Output
  • Vectorization and Apply functions

Part 2

  • Pipes, and how to use them
  • plyr, dplyr, tidyr
  • data.table

Part 3

  • Visualization with ggplot2
  • Adding Interactivity

Part 1

Data Structures

Data structures

R has several core data structures:

  • Vectors
    • Factors
  • Lists
  • Matrices/arrays
  • Data frames

Vectors

Vectors form the basis of R data structures.

Two main types- and , but I will treat lists separately.

Here is an R vector. The elements of the vector are numeric values.

x = c(1, 3, 2, 5, 4)
x
[1] 1 3 2 5 4

Vectors

All elements of an atomic vector are the same type. Examples include:

  • characters
  • numeric (double)
  • integer
  • logical

Factors

A important type of vector is a factor.

Factors are used to represent categorical data structures.

x = factor(1:3, labels=c('q', 'V', 'what the heck?'))
x
[1] q              V              what the heck?
Levels: q V what the heck?

Factors

The underlying representation is numeric.

But, factors are categorical.

They can’t be used as numbers would be.

as.numeric(x)
[1] 1 2 3
sum(x)
Error in Summary.factor(structure(1:3, .Label = c("q", "V", "what the heck?": 'sum' not meaningful for factors

Matrices

With multiple dimensions, we are dealing with arrays.

Matrices are 2-d arrays, and extremely commonly used.

The vectors making up a matrix must all be of the same type.

  • e.g. all values in a matrix must be numeric.

Creating a matrix

Creating a matrix can be done in a variety of ways.

# create vectors
x = 1:4
y = 5:8
z = 9:12

rbind(x, y, z)   # row bind
  [,1] [,2] [,3] [,4]
x    1    2    3    4
y    5    6    7    8
z    9   10   11   12
cbind(x, y, z)   # column bind
     x y  z
[1,] 1 5  9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
matrix(c(x, y, z), nrow=3, ncol=4, byrow=TRUE)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

Lists

Lists in R are highly flexible objects.

They can contain anything as their elements, even other lists.

  • unlike vectors, whose elements must be of the same type.

Here is a list. We use the list function to create one.

x = list(1, "apple", list(3, "cat"))
x
[[1]]
[1] 1

[[2]]
[1] "apple"

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] "cat"

Lists

We often want to loop some function over a list.

for(elem in x) class(elem)

Lists can, and often do, have named elements.

x = list("a" = 25, "b" = -1, "c" = 0)
x["b"]
$b
[1] -1

Data Frames

data.frame are a very commonly used data structure.

They do not have to have the same type of element.

This is because the data.frame class is actually just a list.

As such, everything about lists applies to data.frames.

But they can also be indexed by row or column as well

  • like matrices.

Creating a data frame

mydf = data.frame(a = c(1,5,2),
                  b = c(3,8,1))

We can add row names also.

rownames(mydf) = paste0('row', 1:3)
mydf
     a b
row1 1 3
row2 5 8
row3 2 1

Input/Output

Input/Output

Standard methods of reading in data

  • read.table
  • read.csv
  • readLines

Using the foreign package:

  • read.spss
  • read.xport

Note: the foreign package is no longer useful for Stata files.

Newer approaches

haven: Package to read in foreign statistical files

  • read_spss
  • read_dta

readxl: for excel files

Faster approaches

readr: Faster versions of base R functions

  • read_csv
  • read_delim

These make assumptions after an initial scan of the data.

If you don’t have ‘big’ data, this won’t help much.

However, they actually can be used as a diagnostic.

  • pick up potential data entry errors.

Faster approaches

data.table: faster read.table

  • fread

Typically faster than readr approaches.

Other Data

Note that R can handle many types of data.

Some examples:

  • JSON
  • SQL
  • XML
  • YAML
  • MongoDB
  • NETCDF
  • text (e.g. a novel)
  • shapefiles
  • google spreadsheets

And many, many others.

On the horizon

feather: designed to make reading and writing data frames efficient

Works in both Python and R.

Still in early stages of development.

Indexing

Base R Indexing Refresher

Slicing vectors

letters[4:6]
[1] "d" "e" "f"
letters[c(13,10,3)]
[1] "m" "j" "c"

Slicing matrices/data.frames

myMatrix[1, 2:3]

Base R Indexing Refresher

Label-based indexing:

mydf['row1', 'b']

Position-based indexing:

mydf[1, 2]

Base R Indexing Refresher

Mixed indexing:

mydf['row1', 2]

If the row/column value is empty, all rows/columns are retained.

mydf['row1',]
mydf[,'b']

Base R Indexing Refresher

Non-contiguous:

mydf[c(1,3),]

Boolean:

mydf[mydf$a >=2,]

Base R Indexing Refresher

List/Data.frame extraction

[ : grab a slice of elements/columns

[[ : grab specific elements/columns

$ : grab specific elements/columns

List/Data.frame extraction

my_list_or_df[2:4]
my_list_or_df[['name']]
my_list_or_df$name

Vectorization

Boolean Indexing

Logicals are objects with values of TRUE or FALSE.

Assume x is a vector of numbers.

idx = x > 2
idx
x[idx]

Flexiblity

We don’t have to create a Boolean object before using it.

R indexing is ridiculously flexible.

x[x > 2]
x[x != 3]
x[ifelse(x > 2, T, F)]
x[{y = idx; y}]

Vectorized operations

Consider the following loop:

for (i in 1:nrow(mydf)) {
  check = mydf$x[i] > 2
  if (check==TRUE){
    mydf$y[i] = 'Yes'
  } else {
    mydf$y[i] = 'No'
  }
}

Vectorized operations

Compare:

mydf$y = 'No'
mydf$y[mydf$x > 2] = 'Yes'

This gets us the same thing, and would be much faster.

Vectorized operations

Boolean indexing is an example of a vectorized operation.

The whole vector is considered.

  • Rather than each element individually

This is always faster.

Vectorized operations

Log all values in a matrix.

mymatrix_log = log(mymatrix)

Way faster than looping over elements, rows or columns.

Vectorized Operations

Many vectorized functions already exist in R.

They are often written in C, Fortran etc., and so even faster.

Apply functions

A family of functions allows for a succinct way of looping.

Common ones include:

  • apply
  • lapply, sapply, vapply
  • tapply
  • mapply
  • replicate

Apply functions

  • apply
    • arrays, matrices, data.frames
  • lapply, sapply, vapply
    • lists, data.frames, vectors
  • tapply
    • grouped operations (table apply)
  • mapply
    • multivariate version of sapply
  • replicate
    • similar to sapply

Example

Standardizing variables.

for (i in 1:ncol(mydf)){
  x = mydf[,i]
  for (j in 1:length(x)){
    x[j] = (x[j] - mean(x))/sd(x)
  }
}

The above would be a really bad way to use R.

stdize <- function(x) {
  (x-mean(x))/sd(x)
}

apply(mydf, 2, stdize)

Timings

The previous demonstrates how to use apply.

However, there is a scale function in base R.

Unit: milliseconds
       expr        min          lq        mean     median         uq        max neval
 doubleloop 3112.41884 3130.286411 3198.740874 3144.29025 3227.45382 3663.90853    25
 singleloop   31.59734   32.022406   33.439865   32.69933   34.70190   38.34710    25
       plyr  132.64410  133.432096  139.466588  134.74993  136.99264  242.11898    25
      apply   33.99555   34.213489   35.698046   35.84892   36.95722   37.97816    25
   parApply   21.00966   21.769137   26.662488   22.58505   24.17335   72.32103    25
 vectorized    8.01776    8.635249    9.896537   10.31631   10.45710   13.24826    25

Apply functions

Benefits

  • Cleaner/simpler code
  • Potentially more reproducible
    • more likely to use generalizable functions
  • Parallelizable

NOT faster than explict loops.

  • single loop over columns was as fast as apply
  • Replicate and mapply are especially slow

ALWAYS can potentially be faster than loops.

  • Parallelization: parApply, parLapply etc.

Personal experience

I use R every day, and rarely use explicit loops.

  • Note: no speed difference for a for loop vs. using while
  • If you must use an explicit loop, create an empty object and fill in
    • Faster

I never use a double loop.

Apply functions

Apply functions should be a part of your regular R experience.

Other versions we’ll talk about have been optimized.

However, you need to know the basics in order to use those.

Any you still may need parallel versions.

Part 2

Pipes

Note:

More detail on much of this part is given in another workshop.

Pipes

Operators that send what comes before to what comes after.

There are many different pipes.

There are many packages that use their own.

However, the vast majority of packages use the same pipe:

%>%

Pipes

Here, we’ll focus on their use with the dplyr package.

Later, we’ll use it for visualizations.

Example.

mydf %>% 
  select(var1, var2) %>% 
  filter(var1 == 'Yes') %>% 
  summary

Start with a data.frame %>%

    select columns from it %>%

    filter/subset it %>%

    get a summary

Using variables as they are created

We can use variables as soon as they are created.

mydf %>% 
  mutate(newvar1 = var1 + var2,
         newvar2 = newvar1/var3) %>% 
  summarise(newvar2avg = mean(newvar2))

Pipes for Visualization (more later)

Generic example.

basegraph %>% 
  points %>%
  lines %>%
  layout

The dot

Most functions are not ‘pipe-aware’ by default.

Example: pipe to a modeling function.

mydf %>% 
  lm(y ~ x)  # error

Other pipes can handle this.

  • e.g. %$% in magrittr

But generally, one can use a dot.

  • The dot refers to the object before the pipe.
mydf %>% 
  lm(y ~ x, data=.)

Flexibility

Piping is not just for data.frames.

  • The following starts with a character vector.
  • Sends it to a recursive function (named ..).
  • .. is created on-the-fly.
  • After the function is created, it’s used on ., representing the string.
  • Result: pipes between the words.
c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
  .. <-  . %>%
    if (length(.) == 1)  .
    else paste(.[1], '%>%', ..(.[-1]))
  ..(.)
} 
[1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"
  • Put that in your pipe and smoke it René Magritte!

Pipes

Pipes are best used interactively.

Extremely useful for data exploration.

Common in many visualization packages.

See the magrittr packge for more pipes.

plyr, dplyr, tidyr

plyr

Original data managment package of the three.

More general than dplyr.

Not as useful for most common operations, but contains:

  • more flexible versions of the apply family
  • some very useful functions not found elsewhere

plyr

adply, dlply etc.

  • First letter represents the current object (array, data.frame, list)
  • Second letter represents the returned object
library(plyr)
x = list(var1=1:5, var2=2:6)
ldply(x)
   .id V1 V2 V3 V4 V5
1 var1  1  2  3  4  5
2 var2  2  3  4  5  6
ldply(x, sum)
   .id V1
1 var1 15
2 var2 20

Option to parallelize.

plyr: some useful functions

*ply: apply style functions, with parallel capability

join_all: Recursively join a list of data frames

rbind.fill: row bind data.frames, filling in missing columns.

mapvalues/revalue: replace values

round_any: Round to multiple of any number.

dplyr

Grammar of data manipulation.

Next iteration of plyr.

Focused on tools for working with data frames.

  • Over 100 functions

It has three main goals:

  • Make the most important data manipulation tasks easier.

  • Do them faster.

  • Use the same interface to work with data frames, a data tables or database.

dplyr

Some key operations:

select: grab columns

  • select helpers: one_of, starts_with, num_range etc.

filter/slice: grab rows

group_by: grouped operations

mutate/transmute: create new variables

summarize: summarise/aggregate

do: arbitrary operations

dplyr

Various join/merge functions.

Little things like:

  • n, n_distinct, nth, n_groups, count, recode, between

No need to quote variable names.

An example

Let’s say we want to select from our data the following variables:

  • Start with the ID variable
  • The variables X1:X10, which are not all together, and there are many more X columns
  • The variables var1 and var2, which are the only var variables in the data
  • Any variable that starts with XYZ

How might we go about this?

Some base R approaches

Tedious, or typically two steps just to get the columns you want.

# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]

# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]

# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]

# or via subset
newData = subset(oldData, select = cols)

More

What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)?

# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]

And this is for fairly straightforward operations.

An alternative

newData = oldData %>% 
  filter(Z == 'Yes', Q == 'No') %>% 
  select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>% 
  top_n(var2, n=50) %>% 
  arrange(desc(var1))

An alternative

dplyr and piping is an alternative

  • you can do all this sort of stuff with base R
  • with, within, subset, transform, etc.

Even though the initial base R approach depicted is fairly concise, it still can potentially be:

  • noisier
  • less legible
  • less amenable to additional data changes
  • requires esoteric knowledge (e.g. regular expressions)
  • often requires new objects (even if we just want to explore)

tidyr

Two primary functions for manipulating data

  • gather: wide to long
  • spread: long to wide

Other useful functions include:

  • unite: paste together multiple columns into one
  • separate: complement of unite

Example

library(tidyr)
stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
                      X = rnorm(10, 0, 1),
                      Y = rnorm(10, 0, 2),
                      Z = rnorm(10, 0, 4) )
stocks %>% head
        time           X          Y         Z
1 2009-01-01  0.23465359 -1.9089778 -7.037391
2 2009-01-02  0.87151932  0.2355249 -2.090847
3 2009-01-03  0.03584969 -1.4706570  4.853836
4 2009-01-04 -0.74729694 -0.2460126  1.775394
5 2009-01-05 -0.45235779  1.2348015 -1.189270
6 2009-01-06 -0.49231946  1.6157330  3.669595
stocks %>% gather(stock, price, -time) %>% head
        time stock       price
1 2009-01-01     X  0.23465359
2 2009-01-02     X  0.87151932
3 2009-01-03     X  0.03584969
4 2009-01-04     X -0.74729694
5 2009-01-05     X -0.45235779
6 2009-01-06     X -0.49231946

Personal Opinion

The dplyr grammar is clear for a lot of standard data processing.

The best usage is for on-the-fly data exploration and visualization.

  • No need to create/overwrite existing objects
  • Can overwrite columns as they are created
  • Makes it easy to look at anything, and do otherwise tedious data checks

Drawbacks:

  • not as fast as data.table for many things
  • the mindset can make for unnecessary complication
    • e.g. no need to pipe etc. to create one new variable

On the horizon

multidplyr

Partitions the data across a cluster.

Faster than data.table (after partitioning)

Data.Table

data.table

data.table works in a notably different way than dplyr.

However, you’d use it for the same reasons.

Like dplyr, the data objects are both data.frames and a package specific class.

Faster subset, grouping, update, ordered joins and list columns

data.table

In general, data.table works with brackets as in base R.

However, the brackets work like a function call and have several arguments.

  • Key arguments
x[i, j, by, keyby, with = TRUE, ...]

Importantly, you can’t use the brackets as you would with data.frames.

library(data.table)
df = data.table(x=rnorm(6), g=1:3, y=runif(6))
df[,4]
[1] 4

data.table

Grab rows

df[1:3,]
            x g         y
1:  0.5069566 1 0.3005774
2: -0.6119176 2 0.4066323
3: -1.0640269 3 0.3077456

Grab columns.

df[,y]
[1] 0.30057740 0.40663225 0.30774555 0.21477488 0.09101447 0.34697773

data.table

Dropping columns is awkward.

This is because the second part of the data.table object is an argument ‘j’.

df[,-y]             
[1] -0.30057740 -0.40663225 -0.30774555 -0.21477488 -0.09101447 -0.34697773
df[,-'y', with=F]
            x g
1:  0.5069566 1
2: -0.6119176 2
3: -1.0640269 3
4: -0.2649322 1
5:  0.5482331 2
6:  0.1172551 3

Grouped operations

group-by, with creation of a new variable.

Note that these actually modify df in place.

df1 = df2 = df
df[,sum(x,y), by=g]
   g         V1
1: 1  0.7573766
2: 2  0.4339621
3: 3 -0.2920485
df1[,newvar := sum(x,y), by=g]
            x g          y     newvar
1:  0.5069566 1 0.30057740  0.7573766
2: -0.6119176 2 0.40663225  0.4339621
3: -1.0640269 3 0.30774555 -0.2920485
4: -0.2649322 1 0.21477488  0.7573766
5:  0.5482331 2 0.09101447  0.4339621
6:  0.1172551 3 0.34697773 -0.2920485
df1
            x g          y     newvar
1:  0.5069566 1 0.30057740  0.7573766
2: -0.6119176 2 0.40663225  0.4339621
3: -1.0640269 3 0.30774555 -0.2920485
4: -0.2649322 1 0.21477488  0.7573766
5:  0.5482331 2 0.09101447  0.4339621
6:  0.1172551 3 0.34697773 -0.2920485

We can also create groupings on the fly.

df2[,newvar := sum(x,y), by=g==1]
            x g          y    newvar
1:  0.5069566 1 0.30057740 0.7573766
2: -0.6119176 2 0.40663225 0.1419137
3: -1.0640269 3 0.30774555 0.1419137
4: -0.2649322 1 0.21477488 0.7573766
5:  0.5482331 2 0.09101447 0.1419137
6:  0.1172551 3 0.34697773 0.1419137
df2
            x g          y    newvar
1:  0.5069566 1 0.30057740 0.7573766
2: -0.6119176 2 0.40663225 0.1419137
3: -1.0640269 3 0.30774555 0.1419137
4: -0.2649322 1 0.21477488 0.7573766
5:  0.5482331 2 0.09101447 0.1419137
6:  0.1172551 3 0.34697773 0.1419137

Faster!

  • joins: and easy to do
df1[df2]
  • group operations: via setkey
  • reading files: fread
  • character matches: e.g. via chmatch

Timings

The following demonstrates some timings from here

  • Reproduced on my own machine
  • based on 50 million observations
  • Grouped operations are just a sum and length on a vector.

By the way, never, ever use aggregate. For anything.

          fun elapsed
1:  aggregate  114.35
2:         by   24.51
3:     sapply   11.62
4:     tapply   11.33
5:      dplyr   10.97
6:     lapply   10.65
7: data.table    2.71

Ever.

Really.

Pipe with data.table

Can be done but awkward at best.

mydf[,newvar:=mean(x),][,newvar2:=sum(newvar), by=group][,-'y', with=FALSE]
mydf[,newvar:=mean(x), 
  ][,newvar2:=sum(newvar), by=group
  ][,-'y', with=FALSE
  ]

Probably better to just use a pipe and dot approach

mydf[,newvar:=mean(x),] %>% 
  .[,newvar2:=sum(newvar), by=group] %>% 
  .[,-'y', with=FALSE]

My take

Faster methods are great to have.

  • Especially for group-by and joins.

Drawbacks:

  • You may frequently forget it doesn’t work like a data.frame
    • lost programming time
  • The syntax can be awkward
  • While one can pipe to and from data.tables, piping with it requires more brackets

Compromise

If speed and/or memory is (potentially) a concern, data.table

For interactive exploratoin, dplyr

Piping allows one to use both, so no need to choose.

Part 3

ggplot2

ggplot2

ggplot2 is an extremely popular package for visualization in R.

  • and copied in other languages/programs

It entails a grammar of graphics.

  • Every graph is built from the same few parts

Key ideas:

  • Aesthetics
  • Layers (and geoms)
  • Piping
  • Facets
  • Themes
  • Extensions

ggplot2

Strengths:

  • Ease of getting a good looking plot
  • Easy customization
  • A lot of data processing is done for you
  • Clear syntax
  • Easy multidimensional approach
  • Equally spaced colors as a default

Aesthetics

Aesthetics allow one to map data to aesthetic aspects of the plot.

  • Size
  • Color
  • etc.

The function used in ggplot to do this is aes

aes(x=myvar, y=myvar2, color=myvar3, group=g)

Layers

In general, we start with a base layer and add to it.

In most cases you’ll start as follows.

ggplot(aes(x=myvar, y=myvar2), data=mydata)

This would not produce anything except for a plot background.

Piping

Layers are added via piping.

The first layers added are typically geoms:

  • points
  • lines
  • density
  • text

ggplot2 was using pipes before it was cool, and so it has a different pipe.

Othewise, the concept is the same as before.

ggplot(aes(x=myvar, y=myvar2), data=mydata) +
  geom_point()

And now we would have a scatterplot.

Examples

library(ggplot2)
data("diamonds"); data('economics')
ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point()

Examples

ggplot(aes(x=date, y=unemploy), data=economics) +
  geom_line()

Examples

In the following, one setting is not mapped to the data.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(aes(size=carat, color=clarity), alpha=.25) 

Stats

There are many statistical functions built in.

One of the key strengths of ggplot is that you don’t have to do much preprocessing.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_quantile()

Stats

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_smooth()

Stats

ggplot(mtcars, aes(cyl, mpg)) + 
  geom_point() +
  stat_summary(fun.data = "mean_cl_boot", colour = "orange", alpha=.75, size = 1)

Facets

Facets allow for panelled display, a very common operation.

In general, we often want comparison plots.

facet_grid will produce a grid.

  • Often this is all that’s needed

facet_wrap is more flexible.

Both use a formula approach to specify the grouping.

facet_grid

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  facet_grid(vs ~ cyl, labeller = label_both)

facet_wrap

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  facet_wrap(vs ~ cyl, labeller = label_both, ncol=2)

Fine control

ggplot2 makes it easy to get good looking graphs quickly.

However the amount of fine control is extensive.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(aes(color=clarity), alpha=.5) + 
  scale_y_log10(breaks=c(1000,5000,10000)) +
  xlim(0, 10) +
  scale_color_brewer(type='div') +
  facet_wrap(~cut, ncol=3) +
  theme_minimal() +
  theme(axis.ticks.x=element_line(color='darkred'),
        axis.text.x=element_text(angle=-45),
        axis.text.y=element_text(size=20),
        strip.text=element_text(color='forestgreen'),
        strip.background=element_blank(),
        panel.grid.minor=element_line(color='blue'),
        legend.key=element_rect(linetype=4),
        legend.position='bottom')



Themes

In the last example you saw two uses of a theme.

  • built-in
  • specific customizations

Each argument takes on a specific value or an element function:

  • element_rect
  • element_line
  • element_text
  • element_blank

Themes

The base theme is not too good.

  • not for web
  • doesn’t look good for print either

You will almost invariably need to tweek it.

Extensions

While many contributed before, ggplot2 now has an extension system.

There is even a website to track the extensions.

Examples include:

  • additional themes
  • interactivity
  • animations
  • marginal plots
  • network graphs

Summary ggplot2

ggplot2 is an easy to use, but powerful visualization tool.

Allows one to think in many dimensions for any graph:

  • x
  • y
  • color
  • size
  • opacity
  • facet

2d graphs are not useful for conveying anything but the simplest ideas.

Use ggplot2 to easily go beyond 2d for interesting visualizations.

Packages

ggplot2 is the most widely used package for visualization in R.

However, it is not interactive by default.

Many packages use htmlwidgets, d3 (JavaScript library) etc. to provide interactive graphics.

Packages

General:

  • plotly
    • used also in Python, Matlab, Julia, aside from many interactive plots, can convert ggplot2 images to interactive ones.
  • ggvis
    • interactive successor to to ggplot though not currently actively developed
  • rbokeh
    • like plotly, it also has cross program support

Specific functionality:

  • DT
    • interactive data tables
  • leaflet
    • maps with OpenStreetMap
  • dygraphs
    • time series visualization
  • visNetwork
    • Network visualization

Piping for Visualization

One of the advantages to piping is that it’s not limited to dplyr style data management functions.

Any R function can be potentially piped to, and we’ve seen several examples so far.

This facilitates data exploration, especially visually.

  • don’t have to create objects
  • new variables are easily created and subsequently manipulated just for vis
  • data manipulation not separated from visualziation

htmlwidgets

Many newer visualization packages take advantage of piping as well.

htmlwidgets is a package that makes it easy to use R to create javascript visualizations.

  • i.e. what you see everywhere on the web.

The packages using it typically are pipe-oriented and produce interactive plots.

plotly example

A couple demonstrations with plotly.

Note the layering as with ggplot2.

Piping used before plotting.

library(plotly)
midwest %>% 
  filter(inmetro==T) %>% 
  plot_ly(x=percollege, y=percbelowpoverty, mode='markers') 

plotly example

Plotly has modes, which allow for points, lines, text and combinations.

Traces work similar to geoms.

library(mgcv)

mtcars %>% 
  mutate(amFactor = factor(am, labels=c('auto', 'manual')),
         hovertext = paste(wt, mpg, amFactor),
         prediction = predict(gam(mpg~s(wt), data=mtcars))) %>% 
  arrange(wt) %>% 
  plot_ly(x=wt, y=mpg, color=amFactor, width=800, height=500, mode='markers') %>% 
  add_trace(x=wt, y=prediction, alpha=.5, hover=hovertext, name='gam prediction')

ggplotly

The nice thing about plotly is that we can feed a ggplot to it.

It would have been easier to use geom_smooth, so let’s do so.

gp = mtcars %>% 
  mutate(amFactor = factor(am, labels=c('auto', 'manual')),
         hovertext = paste(wt, mpg, amFactor),
         prediction = predict(gam(mpg~s(wt), data=mtcars))) %>% 
  arrange(wt) %>% 
  ggplot(aes(x=wt, y=mpg)) +
  geom_smooth() +
  geom_point(aes(color=amFactor))
ggplotly()

dygraphs

Dygraphs are useful for time-series.

library(dygraphs)
data(UKLungDeaths)
cbind(ldeaths, mdeaths, fdeaths) %>% 
  dygraph(width=800) %>% 
  dyOptions(stackedGraph = TRUE, colors=RColorBrewer::brewer.pal(3, name='Dark2')) %>%
  dyRangeSelector(height = 20)

visNetwork

library(visNetwork)
visNetwork(nodes, edges, height=600, width=800) %>% 
  visNodes(shape='circle', 
           font=list(), 
           scaling=list(min=10, max=50, label=list(enable=T))) %>% 
  visLegend()

data table

library(DT)
movies %>% 
  select(1:6) %>% 
  filter(rating>9) %>% 
  slice(sample(1:nrow(.), 50)) %>% 
  datatable(rownames=F)

Shiny

Shiny is a framework that can essentially allow you to build an interactive website.

  • Provided by RStudio developers

Most of the more recently developed visualization packages will work specifically within the shiny and rmarkdown settings.

Interactive and Visual Data Exploration

Interactivity allows for even more dimensions to be brought to a graphic.

Interactive graphics are more fun too!

Just a couple visualization packages can go a very long way.

Summary